/* This .do file generates the dataset titled LeeLee_enrollment_BMR_democracy_revised.dta, which should be used to produce most of the figures and tables of: 

Paglayan, Agustina S. 2021. "The Non-Democratic Roots of Mass Education: Evidence from 200 Years." American Political Science Review 115(1): 179-198. First published online by Cambridge University Press, 11 September 2020.

A previous dataset titled Education_LeeLee_Democracy_MAIN.dta was used to produce several figures contained in the original article. That dataset contained two coding errors: (a) Barbados, Belize, Iceland, and Malta were classified as non-democracies using Polity, when in fact they should be coded as missing; and (b) many countries were incorrectly coded as not having universal male suffrage (UMS) from 2001-2010 using PIPE, when in fact they should be coded as having UMS. This .do file corrects those errors. It also includes information about the assumptions made when merging Lee and Lee’s school enrollment dataset with the three datasets used to measure the timing of democratization. 

Although correcting these coding errors does not alter the conclusions of the above article, the dataset generated by this .do file (dataset LeeLee_enrollment_BMR_democracy_revised.dta) is the correct one and should replace the previous dataset (Education_LeeLee_Democracy_MAIN.dta). 

Before executing the code, download the following files in your selected path/ directory:
LeeLee_enroll_MF_forSTATA.csv
First year with Primary enrollment data available.csv
BoixMillerRosato_democracy-v2.0_1.dta
PIPE_081813_Przeworski_democracy.dta
PolityIV_v2016.csv

This code was generated with Stata version 16.

*/

clear all

global path "C:\Users\pagla\Dropbox\Stanford\Dissertation\Data\"  	// AGUSTINA'S DATA ACCESS PATH 
global saves "C:\Users\pagla\Dropbox\Stanford\Dissertation\Civil War, State Consolidation, and the Spread of Mass Education\Book\The Non-Democratic Origins of Education\APSR Replication (revised)\"  	// AGUSTINA'S SAVE PATH 
cd "${path}"	

* -------------------------------------------------
* LEE & LEE: ENROLLMENT DATA
* -------------------------------------------------

insheet using "${path}Lee and Lee Enrollment Ratios and Educational Attainment, 1820-2010\LeeLee_enroll_MF_forSTATA.csv", names case comma 

keep if Year!=.
keep if ccode!=.  	// drops Reunion and Hong Kong
rename Year year
tsset ccode year
tsfill
bysort ccode: ipolate Primary year, gen(primratio)
bysort ccode: ipolate Secondary year, gen(secratio)
bysort ccode: ipolate Tertiary year, gen(tertratio)
by ccode (Country), sort: assert Country == Country[_N] | missing(Country)
by ccode (Country): replace Country = Country[_N] // added this to fill in the country names here since you do end up merging on country below and not on ccode 
sort ccode year

gen region=.
replace region=1 if Region=="Advanced Economies"
replace region=2 if Region=="Asia and the Pacific"
replace region=3 if Region=="Eastern Europe"
replace region=4 if Region=="Latin America and the Caribbean"
replace region=5 if Region=="Middle East and North Africa"
replace region=6 if Region=="Sub-Saharan Africa"
bysort ccode: egen NewRegion=min(region)
keep Country ccode year Primary Secondary Tertiary primratio secratio tertratio Region NewRegion
sort Country year
save "${saves}LeeLee_enrollment_BMR_democracy_revised.dta", replace


* -------------------------------------------------
* MERGE WITH YEAR FIRST PRIMARY ENROLLMENT DATA
* -------------------------------------------------

/* 
Although the Lee and Lee dataset in principle contains school enrollment rates for all countries from 1820 to 2010, many of the enrollment rates that appear in the dataset for earlier years, including most of the data preceding 1900, are based not on actual enrollment statistics but on backward extrapolations done by Lee and Lee. The variable Year_First_Primary_Data indicates, for each country, the first year for which the primary school enrollment rates reported in Lee and Lee’s dataset were computed using actual primary school enrollment statistics (i.e., enrollment rates before this year are based on Lee and Lee’s backward extrapolations). I created the variable Year_First_Primary_Data using the information provided by Lee and Lee in their sources appendix.
*/

clear
insheet using "${path}Lee and Lee Enrollment Ratios and Educational Attainment, 1820-2010\First year with Primary enrollment data available.csv", names case comma
sort Country
merge 1:m Country using "${saves}LeeLee_enrollment_BMR_democracy_revised.dta"
sort ccode year
drop if ccode==. 	

label var Year_First_Primary_Data "First Year with Primary School Enrollment Data"
drop _merge
bysort ccode: egen Year_Last_Primary_Data=max(year) if year != .  // creates a variable denoting last year for which there is data for a country
save "${saves}LeeLee_enrollment_BMR_democracy_revised.dta", replace

collapse (mean) Year_First_Primary_Data Year_Last_Primary_Data, by (Country ccode)
export excel using   "${saves}Merge LeeLee with democracy datasets.xlsx", sheet("KN") sheetreplace firstrow(variables) // exporting list to Excel to cross-check 


* -------------------------------------------------
* MERGE WITH BOIX, MILLER AND ROSATO
* -------------------------------------------------

use "${path}BoixMillerRosato_democracy-v2.0_1.dta", clear
sort ccode year
keep country ccode year democracy sovereign

* Export list to Excel
bysort country: egen Year_First_BMR=min(year) if year!=.
bysort country: egen Year_Last_BMR=max(year) if year !=.
preserve
collapse (mean) Year_First_BMR Year_Last_BMR, by (country ccode)
export excel using   "${saves}Merge LeeLee with democracy datasets.xlsx", sheet("BMR") sheetreplace firstrow(variables) // exporting list to Excel before any changes made to ccodes, etc. 
restore 

* Data cleaning:

/* 
A central goal of Paglayan (2021) is to examine the relationship between education and regime type over a much longer time period than what had been feasible in previous cross-national studies, which relied on education data compiled by UNESCO and available only from 1960s onwards. To accomplish this goal, the analysis must retain as much data on school enrollment rates from Lee and Lee as possible. In the case of seven countries, this requires making an assumption about what is the country unit that the enrollment rates refer to. Specifically, while Lee and Lee provide school enrollment rates from 1820-2010 for seven countries that they call "Czech Republic," "Germany," "Pakistan," "Republic of Korea," "Russia," "Serbia," and "Yemen," these countries did not exist with this exact name throughout the entire period 1820-2010. The following code implements and explains the assumptions made.
*/

replace ccode=315 if ccode==316 	// Merges Czech Republic in Lee & Lee with Czechoslovakia in BMR from 1918-92 and with Czech Republic in BMR from 1993-2010. 
replace ccode=255 if ccode==265 & year>1945 & year<1990 // Merges Germany in Lee & Lee with Germany in BMR from 1820-1945 and 1990-2010, and with East Germany in BMR from 1946-1989. (Assumes Germany not democratic between 1946-1989.)  
replace ccode=770 if ccode==769 & year>=1947 & year<=1971 // Merges Pakistan in Lee & Lee with Pakistan (incl. Banglad.) in BMR from period 1947-1971 and with Pakistan in BMR from 1972-2010. 
replace ccode=730 if ccode==732 & year>=1948 & year<=2010 // Merges Republic of Korea in Lee & Lee with Korea in BMR from 1820-1910 and with South Korea in BMR from 1948-2010. Korea under Japanese rule in 1910-1945; not a democracy. 
replace ccode=365 if ccode==364 & year>=1922 & year<=1991 // Allows merge with all data from Russia in Lee and Lee (Russian Federation in BMR from 1800-1921 and 1992-2010; USSR in BMR from 1922-1991) 
replace ccode=342 if ccode==345 & year>=1921 & year<=1991 
replace ccode=342 if ccode==347 & year>=1992 & year<=2005 // Last two lines allow merge of Serbia in Lee & Lee with the following countries/ periods in BMR: Serbia from 1830-1920, Yugoslavia with ccode 345 from 1921-1991, Yugoslavia with ccode 347 from 1992-2005, Serbia from 2006-2010. 
replace ccode=678 if ccode==679 & year>=1990 & year<=2010 // Allows merge with all data from Yemen in Lee and Lee (Yemen non-democratic for the full period regardless of how it is merged.) 
sort ccode year

* Generate temp variables (NOTE: My measure of democracy assumes that being independent (sovereign==1) is a necessary condition for citizens to have the freedom to choose their own representatives; therefore, countries that are not independent (sovereign==0) are considered non-democratic):

gen BMRtemp=1
gen democBMRtemp=.
replace democBMRtemp=1 if democracy==1 & sovereign==1  
replace democBMRtemp=0 if (democracy==0 & sovereign==1) | sovereign==0 
label var democBMRtemp "Democracy (BMR) temp"
bysort ccode: egen YeardemocBMR=min(year) if democBMRtemp==1
bysort ccode: egen YearFirstDemocBMR=mean(YeardemocBMR)
label var YearFirstDemocBMR "Year First Democracy (BMR) temp"

bysort ccode: egen YearSovereign=min(year) if sovereign==1
bysort ccode: egen YearFirstSovereigntemp=mean(YearSovereign)
label var YearFirstSovereigntemp "Year First Sovereign (BMR) temp"

* Merge:

merge 1:1 ccode year using "${saves}LeeLee_enrollment_BMR_democracy_revised.dta"

* Generate variables: 

bysort ccode: egen BMR=mean(BMRtemp)

bysort ccode: egen YearFirstSovereign=mean(YearFirstSovereigntemp)
label var YearFirstSovereign "Year First Sovereign (BMR)"
bysort ccode: egen YearFirstDemocracy=mean(YearFirstDemocBMR)
label var YearFirstDemocracy "Year First Democracy (BMR)"

sort Country year
gen democBMR=democBMRtemp 
replace democBMR=0 if sovereign!=1 & year<YearFirstDemocracy & BMR==1 // 7,905 obs. 

gen democBMR2=.
replace democBMR2=1 if democracy==1 & sovereign==1  
replace democBMR2=0 if (democracy==0 & sovereign==1) | sovereign==0 
replace democBMR2=0 if year<YearFirstSovereign & YearFirstSovereign!=. & BMR==1
replace democBMR2=0 if sovereign==. & year<YearFirstDemocracy & YearFirstDemocracy!=. & YearFirstSovereign!=. & BMR==1

drop _merge
drop if primratio==.  

save "${saves}LeeLee_enrollment_BMR_democracy_revised.dta", replace


* -------------------------------------------------
* MERGE WITH PIPE DATASET
* -------------------------------------------------

use "${path}PIPE_081813_Przeworski_democracy.dta", clear
rename cowcodes ccode
sort ccode year
keep country countryn ccode f year id id2 supra
drop if country==. 			// these are empty rows and can be dropped
sort ccode year
bysort country (ccode): replace ccode = ccode[1] if missing(ccode) 		// assign missing ccodes based on assumption that countries with same PIPE country id (variable called country) would have the same ccode

* Export list to Excel before data cleaned
bysort countryn: egen Year_First_PIPE=min(year) if year!=.
bysort countryn: egen Year_Last_PIPE=max(year) if year!=.
preserve
collapse (mean) Year_First_PIPE Year_Last_PIPE, by (countryn country ccode)
export excel using   "${saves}Merge LeeLee with democracy datasets.xlsx", sheet("PIPE") sheetreplace firstrow(variables)
restore 

sort ccode year

* Data cleaning: (see the comment on line 104)

drop if year>2010
replace ccode=750 if country==521 // Merges India in Lee&Lee with British India in PIPE from 1820-1945 and with India in PIPE from 1946-2006.
replace ccode=315 if country==155 
drop if countryn == "Czech Republic" & year == 1990
drop if countryn == "Czech Republic" & year == 1991
drop if countryn == "Czech Republic" & year == 1992 // Merges Czech Republic in Lee&Lee with Czechoslovakia (1820-1992) and with Czech Republic (1993-2008) in PIPE. Czechoslovakia dissolved on Dec 31, 1992, so dropped observations for Czech Republic from 1990-1992 so no overlap. 
replace ccode = 730 if country == 87 		// Merges Republic of Korea in Lee & Lee with South Korea (1820-1944) and Korea, South (1945-2008) in PIPE. 
replace ccode = 678 if country == 100 & year<1990
replace ccode = 678 if country == 190 & year>=1990 // see note below on Yemen 
drop if ccode == 255 & year>1945 & year <1990
replace ccode = 255 if ccode == 265 // see note below on Germany 
replace ccode=342 if ccode == 345  	
drop if country == 192 & year == 1990
drop if country == 192 & year == 1991 // Merges Serbia in Lee & Lee with Yugoslavia & Serbia in PIPE (UMS in 1920). 
drop if countryn == "Vietnam, North" // duplicates are ccode 816 (1954-1975) which is from North Vietnam & Vietnam. Vietnam is NOT in Lee & Lee, but dropped the North Vietnam observations and kept Vietnam observations for smooth merging.
drop if country == 93 & year<1971 	// Merges Pakistan in Lee & Lee with Pakistan including Bangladesh in PIPE from 1947-1970 and with Pakistan in PIPE from 1971-2008.
drop if ccode==.  // Drops East Timor, Oman, Tuvalu and Leeward Islands, etc., which are not in Lee & Lee
drop if country == 125 & year == 1820
drop if country == 125 & year == 1870
drop if country == 125 & year == 1913
// there are duplicates in the Turkey/Ottoman empire data for years 1820, 1870, and 1913. For 1820 and 1870, the value of "f" doesn't change based on which observation you are looking at so I deleted at random. For 1913, one observation has "f" as "." and another has "f" as "7". I kept the observation with a 7, since Turkey had previously had a 7, but this can be easily changed. 
drop if country==100 & year==1990 // Duplicate for Yemen in 1990 (country==190 and country==100). f is missing for country==100; I kept country==190.
* NOTE: PIPE has data for West Germany, East Germany, and Germany for 1946-1989. I used the East Germany data for the merge during the period in which the country was split to be consistent with BMR and Polity. I did something similar for Yemen, which had data for Yemen the supra-unit, South Yemen, and North Yemen. I used the data for North Yemen to be consistent with BMR and Polity, since they don't have supra-unit data for when the country split.

bysort ccode: egen Year_First_PIPE2=min(year) if year!=.

* Generate temp variables:

gen PIPEtemp=1

gen msuftemp=(f==7 | f==71 | f==72)
replace msuftemp=. if f==. 
label var msuftemp "Univ. Male Suffrage temp"

bysort ccode: egen Yearmsuf=min(year) if msuftemp==1
bysort ccode: egen YearManhoodSuffragetemp=mean(Yearmsuf)
label var YearManhoodSuffragetemp "Year Universal Male Suffrage temp"

gen fsuftemp=(f==72)
replace fsuftemp=. if f==.
label var fsuftemp "Univ. Female Suffrage temp"

bysort ccode: egen Yearfsuf=min(year) if fsuftemp==1
bysort ccode: egen YearFemaleSuffragetemp=mean(Yearfsuf)
label var YearFemaleSuffragetemp "Year Universal Female Suffrage temp"

* Merge: 

merge 1:1 ccode year using "${saves}LeeLee_enrollment_BMR_democracy_revised.dta"
drop _merge 

* Generate variables:

bysort ccode: egen PIPE=mean(PIPEtemp)

bysort ccode: egen Year_First_PIPE3=mean(Year_First_PIPE2)		// Year_First_PIPE3 is a measure of sovereignty; according to the PIPE codebook, "The data set is intended to cover all countries that were or would become independent at any time after 1917, each from the inception of the first representative institutions or from the date of independence, whichever comes first, through 2008."

bysort ccode: egen YearManhoodSuffrage=mean(YearManhoodSuffragetemp)
label var YearManhoodSuffrage "Year Universal Male Suffrage"
bysort ccode: egen YearFemaleSuffrage=mean(YearFemaleSuffragetemp)
label var YearFemaleSuffrage "Year Universal Female Suffrage"

gen msuf=msuftemp
sort Country year 
replace msuf=0 if msuftemp==. & PIPE==1 & year<YearManhoodSuffrage & YearManhoodSuffrage!=. & year>=Year_First_PIPE3

gen fsuf=fsuftemp
replace fsuf=0 if fsuftemp==. & PIPE==1 & year<YearFemaleSuffrage & YearFemaleSuffrage!=. & year>=Year_First_PIPE3

drop if primratio==.  	// drops observations in PIPE but not in Lee & Lee, and observations in both datasets but for year<1820

save "${saves}LeeLee_enrollment_BMR_democracy_revised.dta", replace


* -------------------------------------------------
* MERGE WITH POLITY IV v2016
* -------------------------------------------------

clear
insheet using "${path}PolityIV_v2016.csv"
sort ccode year
rename country country_Polity

keep country_Polity ccode year polity2 polity

* Export list to Excel
bysort ccode: egen Year_First_Polity=min(year) if year!=.
bysort ccode: egen Year_Last_Polity=max(year) if year !=.
preserve
collapse (mean) Year_First_Polity Year_Last_Polity, by(country_Polity ccode)
export excel using   "${saves}Merge LeeLee with democracy datasets.xlsx", sheet("Polity") sheetreplace firstrow(variables)
restore 

* Data cleaning: (see the comment on line 104)

replace ccode = 342 if ccode==347 & year>1991 & year<2006 
replace ccode = 342 if ccode==345 // Last two lines of code merge Serbia in Lee & Lee with Yugoslavia in Polity for 1921-1991 and with Serbia/Montenegro in Polity for period 1992-2005.
replace ccode = 255 if ccode==265 & year>1945 & year<1990 // Germany data ends at 1945 and starts at 1990 again. Merge with East Germany for the missing years, as in BMR. Also note that ccode 255 is called "Prussia" until 1867 in the dataset.  
replace ccode=315 if ccode==316 // Merges Czech Republic and Czechoslovakia as in PIPE & BMR.
replace ccode=730 if ccode==732 // Merges South Korea with Korea data, as before. 
replace ccode=770 if ccode==769 // Merges two Pakistans, as in BMR. 
replace ccode=818 if ccode==817 // Merges Vietnam with South Vietnam, but Vietnam is NOT in Lee & Lee. 
drop if ccode==678 & year==1990
replace ccode=678 if ccode==679 // Last two lines of code merge Yemen with North Yemen, as before. Yemen reunified in May 1990.
drop if ccode==364 & year==1922
replace ccode=365 if ccode==364 & year>=1923 & year<=1991 // Merges Russia in Lee & Lee with USSR in Polity from 1923-1991.

replace ccode = 530 if ccode == 529 & year>=1993 & year!=. // merging Ethiopia for ease of overall merge, but Ethiopia is NOT in Lee & Lee
drop if ccode==530 

* Generate temp variables:

gen Politytemp=1

bysort ccode: egen Year_First_Polity2=min(year) if year!=. 	// Year_First_Polity2 is a measure of sovereignty; according to the PolityIV codebook, "As presently constituted, the Polity IV dataset encompasses 162 contemporary countries (as of December 31, 2006), including all countries where the 2006 population exceeds five hundred thousand. The long-established members of the international system are coded beginning in 1800. More recently established countries are coded from the year in which their first independent government was formed, that is, beginning with the official date of the country’s independence."

* Merge:
 
keep country_Polity year ccode polity2 polity Year_First_Polity Year_First_Polity2 Politytemp
merge 1:1 ccode year using "${saves}LeeLee_enrollment_BMR_democracy_revised.dta"

* Generate variables:

bysort ccode: egen Polity=mean(Politytemp)
bysort ccode: egen Year_First_Polity3=mean(Year_First_Polity2) // Year_First_Polity3 is a measure of sovereignty; see notes above.

gen polity2b=. 
replace polity2b=1 if polity2>=6 & polity2<=10
replace polity2b=0 if polity2>=-10 & polity2<=5 
replace polity2b=0 if polity2==. & Polity==1 & year<Year_First_Polity3 & Year_First_Polity3!=. // code as non-democratic if the country is included in PolityIV but the country-year corresponds to a year preceding the country's independence. Assumes that being independent is a necessary condition for citizens to have the freedom to choose their own representatives, and therefore, countries that are not independent are not democratic. 
replace polity2b=0 if polity==-66   	// Following the same rationale, code as non-democratic those countries that are under foreign occupation.  

gen Politysovereign=.
replace Politysovereign=1 if year>=Year_First_Polity3 & Year_First_Polity3!=. 
replace Politysovereign=0 if year<Year_First_Polity3 & Year_First_Polity3!=. 

bysort ccode: egen YearFirstDemocPolity=min(year) if polity2b==1
bysort ccode: egen YearFirstPolityDemoc=mean(YearFirstDemocPolity)
label var YearFirstPolityDemoc "Year First Democracy (Polity2>=6 & <=10)"

drop if year<1820 | year>2010
drop if primratio==.
drop _merge
save "${saves}LeeLee_enrollment_BMR_democracy_revised.dta", replace


*--------------------------------------------------------------------------------------------------------------------------------------------------
* Generate absorbing treatment variables (treatment equals 1 for first year and remains 1 for all subsequent years). Relevant variable names to creat absorbing variable: YearFirstDemocracy, YearFirstPolityDemoc, YearManhoodSuffrage, and YearFemaleSuffrage
*--------------------------------------------------------------------------------------------------------------------------------------------------

gen democBMR_abs=. 
replace democBMR_abs=1 if year>=YearFirstDemocracy 
replace democBMR_abs=0 if BMR==1 & (year<YearFirstDemocracy | YearFirstDemocracy==.)

gen polity2b_abs=.
replace polity2b_abs=1 if year>=YearFirstPolityDemoc 
replace polity2b_abs=0 if Polity==1 & (year<YearFirstPolityDemoc | YearFirstPolityDemoc==.)

gen msuf_abs=.
replace msuf_abs=1 if year>=YearManhoodSuffrage
replace msuf_abs=0 if PIPE==1 & (year<YearManhoodSuffrage | YearManhoodSuffrage==.)

gen fsuf_abs=.
replace fsuf_abs=1 if year>=YearFemaleSuffrage
replace fsuf_abs=0 if PIPE==1 & (year<YearFemaleSuffrage | YearFemaleSuffrage==.)

save "${saves}LeeLee_enrollment_BMR_democracy_revised.dta", replace
